{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MLlib: Clustering and Dimensionality Reduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Introduction to Spark with Python, by Jose A. Dianes](http://jadianes.github.io/spark-py-notebooks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will use Spark's machine learning library [MLlib](https://spark.apache.org/docs/latest/mllib-guide.html) to perform [**K-means Clustering**](https://en.wikipedia.org/wiki/K-means_clustering) over our network attacks datasets. We will use the complete [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) datasets in order to test Spark capabilities with large datasets. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With k-means clustering we aim to partition our `n` observations (i.e. network interactions in our dataset) into `k` clusters in which each observation belongs to the cluster with the nearest mean. An important part of using k-Means clustering is to determine the right value for `k` or number of clusters. In order to do that we will sample our dataset and iterate through different possible values. Once we have one that works well, we will use it to cluster the complete dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have our data clustered, we will do two things. First we will use dimensionality reduction to obtain a two dimensional dataset that we can represent visually. Then we will use our knowledge of Spark SQL and data frames from previous notebooks in order to explore each of the obtained clusters in terms of its features.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the time of processing this notebook, our Spark cluster contains:  \n",
    "\n",
    "- Eight nodes, with one of them acting as master and the rest as workers.  \n",
    "- Each node contains 8Gb of RAM, with 6Gb being used for each node.  \n",
    "- Each node has a 2.4Ghz Intel dual core processor.  \n",
    "- Running Apache Spark 1.4.1.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the data and creating the RDD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we said, this time we will use the complete dataset provided for the [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), containing nearly half million nework interactions. The file is provided as a Gzip file that we will download locally. Remember that the file must be accessible to every worker in our Spark cluster (in our case we use [NFS](https://en.wikipedia.org/wiki/Network_File_System)).    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import urllib\n",
    "f = urllib.urlretrieve (\"http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz\", \"kddcup.data.gz\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train data size is 4898431\n"
     ]
    }
   ],
   "source": [
    "data_file = \"./kddcup.data.gz\"\n",
    "raw_data = sc.textFile(data_file)\n",
    "\n",
    "print \"Train data size is {}\".format(raw_data.count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) also provide test data that we will load in a separate RDD.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ft = urllib.urlretrieve(\"http://kdd.ics.uci.edu/databases/kddcup99/corrected.gz\", \"corrected.gz\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test data size is 311029\n"
     ]
    }
   ],
   "source": [
    "test_data_file = \"./corrected.gz\"\n",
    "test_raw_data = sc.textFile(test_data_file)\n",
    "\n",
    "print \"Test data size is {}\".format(test_raw_data.count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unsupervised learning with K-Means Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Firs of all, we prepare the data for clustering input. The data contains non-numeric features, and we want to exclude them since k-means works just with numeric features. These are the first three and the last column in each data row that is the label. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to do that, we define a function that we apply to the *RDD* as a `Spark` **transformation** by using `map`. The **action** that actually retrieves the data is `values`. Remember that we can apply as many transofmrations as we want without making `Spark` start any processing. Is is when we trigger an action when all the transformations are applied.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "\n",
    "def parse_interaction(line):\n",
    "    \"\"\"\n",
    "    Parses a network data interaction.\n",
    "    \"\"\"\n",
    "    line_split = line.split(\",\")\n",
    "    clean_line_split = [line_split[0]]+line_split[4:-1]\n",
    "    return (line_split[-1], array([float(x) for x in clean_line_split]))\n",
    "\n",
    "parsed_data = raw_data.map(parse_interaction)\n",
    "parsed_data_values = parsed_data.values().cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also standardise our data as we have done so far when performing distance-based clustering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data standardized in 308.888 seconds\n"
     ]
    }
   ],
   "source": [
    "from time import time\n",
    "from pyspark.mllib.feature import StandardScaler\n",
    "\n",
    "standardizer = StandardScaler(True, True)\n",
    "\n",
    "t0 = time()\n",
    "standardizer_model = standardizer.fit(parsed_data_values)\n",
    "tt = time() - t0\n",
    "\n",
    "standardized_data_values = standardizer_model.transform(parsed_data_values)\n",
    "print \"Data standardized in {} seconds\".format(round(tt,3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we are ready to perform k-means clustering. The call to `KMeans.train` triggers the clustering process and returns a list of cluster centroids as well as a model que can use to assign a point to a cluster by calling `predict`. This can be seen in the `error` function declared bellow. There are other parameters for the `train` function, that are explained in the [Spark reference](http://spark.apache.org/docs/latest/mllib-clustering.html#k-means). In our case we will set three of the, the maximum number of iterations for the algorithms, the number of runs, and how do we want to initialise the cluster centroids for the iterative clustering process. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also need to determine a good value for K (the number of clusters for k-means). We will do this visually. We will try with a range of values from 5 to 200, and plot their *cost function* by calling `KMeansModel.computeCost()` for each clustering result. This process will take a while (in the order of minutes). Each time a new value of K has been tried you will see some output printed out. If you want to skip the whole process, you can change the list of values in the for loop to just the best value we have found after this block."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "K=5, cost=13925173.0774, clustered in 127.986 seconds, Cost calculated in 75.817 seconds\n",
      "K=10, cost=12690468.3591, clustered in 165.513 seconds, Cost calculated in 76.243 seconds\n",
      "K=20, cost=11906003.3641, clustered in 207.67 seconds, Cost calculated in 75.278 seconds\n",
      "K=40, cost=10693477.5776, clustered in 271.452 seconds, Cost calculated in 74.855 seconds\n",
      "K=80, cost=6305833.21914, clustered in 382.217 seconds, Cost calculated in 75.535 seconds\n",
      "K=120, cost=5070507.67962, clustered in 471.605 seconds, Cost calculated in 76.267 seconds\n",
      "K=160, cost=4483636.09054, clustered in 553.414 seconds, Cost calculated in 76.276 seconds\n",
      "K=200, cost=4018187.25798, clustered in 632.657 seconds, Cost calculated in 76.087 seconds\n"
     ]
    }
   ],
   "source": [
    "from pyspark.mllib.clustering import KMeans\n",
    "from math import sqrt\n",
    "\n",
    "kmeans_model = {}\n",
    "cost = {}\n",
    "\n",
    "# sample data so the process ends in a reasonable amount of time\n",
    "# if you have a really powerful cluster, you can skip sampling\n",
    "standardized_data_values_sample = standardized_data_values.sample(False, .1, 1234)\n",
    "\n",
    "    \n",
    "for k_value in [5, 10, 20, 40, 80, 120, 160, 200]:    \n",
    "    c_t0 = time()\n",
    "    kmeans_model[k_value] = KMeans.train(\n",
    "            standardized_data_values_sample, \n",
    "            k_value, \n",
    "            maxIterations=20, \n",
    "            runs=10, \n",
    "            initializationMode=\"random\"\n",
    "    )\n",
    "    c_tt = time() - c_t0\n",
    "\n",
    "    w_t0 = time()\n",
    "    cost[k_value] = kmeans_model[k_value].computeCost(standardized_data_values_sample)\n",
    "    w_tt = time() - w_t0\n",
    "    \n",
    "    print \"K={}, cost={}, clustered in {} seconds, Cost calculated in {} seconds\".format(\n",
    "        k_value, cost[k_value], round(c_tt,3), round(w_tt,3))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we build a [Pandas data frame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) that we will use to plot the WSSSE values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7fae70084a90>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAEGCAYAAABvtY4XAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHjJJREFUeJzt3XmcVOWV//HPoRHZRWgEmkVUEFeEgERDjJ3oGMDdxFEU\nEDVKDEZMhoziOIEfgwtjQBNNDAoSJC5xB3mJoyIdNxQxbQeiIqAomyBrMAgCfX5/PNXQNNX0VlW3\nlu/79apXV9W9fetwLY6Hc5/nuebuiIhI9qkXdQAiIpIcSvAiIllKCV5EJEspwYuIZCkleBGRLKUE\nLyKSpVKW4M3sITNba2YLq7HvRDMrjj0Wm9mmVMQoIpJNLFXj4M3sNOAr4GF3P7EGv3c90MPdf5K0\n4EREslDKKnh3fx3YpxI3s6PMbLaZLTCz18ysW5xfvQx4LCVBiohkkfoRf/4DwDB3X2pm3wb+AJxR\nttHMDgc6A69GE56ISOaKLMGbWVPgVOBJMyt7u0GF3S4FnnStpyAiUmNRVvD1gM3u3vMA+1wC/CxF\n8YiIZJUqe/DVHf1iZieb2S4zu6g6H+zu/wQ+NbMfx37fzKx7ueMdAxzq7m9X53giIrKv6lxknQr0\nO9AOZpYHjAdeBKySfR4D3gK6mdkKM7sSuBy42szeBxYB55X7lUvQxVURkVqr1jBJM+sMPF/Z8EYz\nuxH4BjgZmOXuTycwRhERqYU6D5M0s/bA+cD9sbd0QVREJA0kYhz8PcDNsZEuRiUtGhERSa1EjKLp\nBTweG+qYD/Q3s53uPrP8Tmamyl5EpBbcvVaFc50reHc/0t2PcPcjgKeA6yom93L76lGDx+jRoyOP\nIRsfOq86r5n0qIsqK/jY6JfTgXwzWwGMBg6KJexJNfkwdzA1cEREUqLKBO/uA6t7MHe/8kDbP/wQ\njjuuukcTEZG6SOl68M89l8pPy3yFhYVRh5CVdF6TQ+c1/aRyuWA/+WRn/vyUfJyISFYwM7yWF1lT\nuhbNsmWwahW0b5/KTxWRdGO6GBdXogvulLZoBgyAmXHH14hIrol6dEq6PZIhpQn+ggvUhxcRSZWU\n9uC3bnUKCuDzz6FFi5R8rIikoVhfOeow0kpl56QuPfiUVvBNm8Lpp8Ps2an8VBGR3JTSBA9w/vkw\nY0aqP1VEJPekPMGfey68+CLs2JHqTxYRqZ5HH32U3r1706xZMwoKChgwYABvvvlmrY/XuXNnXn01\n9beWTnmCb9MGTjgB5s5N9SeLiFRt4sSJ/OIXv+DWW29l3bp1rFixguHDhzOzDkMAI7vmkMIhQF7m\nrrvchw1zEclR5fNBOtm8ebM3bdrUn3rqqbjbt2/f7iNGjPCCggIvKCjwG2+80Xfs2OHu7l9++aWf\nffbZ3qJFC2/ZsqWfdtppXlpa6oMGDfJ69ep5o0aNvGnTpn7XXXfFPXZl5yT2fq3ybsoreAh9+Jkz\nobQ0ik8XEYlv3rx5bN++nQsvvDDu9ttuu4358+dTUlJCSUkJ8+fPZ9y4cQBMmDCBjh07sn79etat\nW8cdd9yBmTF9+nQ6derErFmz2Lp1KyNHjkzZnyeSBN+1Kxx6KLz7bhSfLiLpziwxj5rasGED+fn5\n1KsXPzU++uij/PrXvyY/P5/8/HxGjx7N9OnTAWjQoAFr1qxh+fLl5OXl0bdv37qcgoSIJMGDJj2J\nSOXcE/OoqVatWrF+/XpKK2kvrF69msMPP3zP606dOrF69WoAfvWrX9GlSxfOOussjjrqKMaPH1+r\nP3siKcGLiMSceuqpHHzwwTz77LNxtxcUFLB8+fI9rz///HMKCgoAaNq0Kb/5zW9YtmwZM2fOZOLE\nicyNjSaJau2dyBJ8r17wz3/C4sVRRSAisq9DDjmEsWPHMnz4cGbMmMG2bdvYuXMns2fP5qabbmLg\nwIGMGzeO9evXs379esaOHcvgwYMBmDVrFkuXLsXdad68OXl5eXtaPW3atGHZsmWp/wPV9upsTR/E\nuUJ83XXu48fHvXAsIlksXj5IJ4888oj37t3bmzRp4m3btvVzzjnH582b59u3b/cbbrjB27Vr5+3a\ntfMRI0bsGUVz9913e+fOnb1JkybeoUMHHzdu3J7jzZgxwzt16uQtWrTwCRMmxP3Mys4JdRhFk9K1\naCp+1ksvwZgx8NZbKQlBRNKE1qLZXzLWook0wX/zTZj49MEH0K5dSsIQkTSgBL+/jF9srKIGDaB/\nf3j++SijEBHJTpEmeNDiYyIiyRJpiwbCSJoOHcKt/Jo1S0koIhIxtWj2l3UtGoDmzaFv37DCpIiI\nJE7kCR406UlEJBkib9EArFkDxx8fJj21bp2ScEQkQlHN7Ex3iW7R1K9zRAnQrh1cfz2ceSa8+iq0\nahV1RCKSTOq/p0ZaVPAQFgYaNSpMfpozJ6w2KSKS6zJ2olNF7jByJLz2Grz8MrRokZLQRETSVtYk\neAhJ/sYb4Z13QjXfvHkKghMRSVNZleAhJPnhw6GkJAyf1Ph4EclVWZfgIdzOb9iwMLJm9mxo0iSJ\nwYmIpKmsTPAQkvzVV8Nnn8GsWdC4cZKCExFJU1mb4AF274ahQ2Ht2nCj7oYNEx+biEi6yuilCqqS\nlwdTp4ax8RdeCDt2RB2RiEhmSPsKvsyuXXDppSHBP/10WGpYRCTbZXUFX6Z+fXjssVDRX3op7NwZ\ndUQiIumtygRvZg+Z2VozW1jJ9svNrMTM/m5mb5pZ98SHGRx0EPzlL+FOUJdfHqp6ERGJrzoV/FSg\n3wG2fwJ8z927A/8DPJCIwCpz8MHw1FOwdSsMGRIuwoqIyP6qTPDu/jqw6QDb57n7ltjLd4AOCYqt\nUg0bwjPPwJdfwpVXKsmLiMST6B781cALCT5mXI0ahVv9rVwJ11wTxsyLiMheCVsu2My+D1wF9K1s\nnzFjxux5XlhYSGFhYZ0+s3HjcMPu/v3huuvg/vuhXsZcNhYR2V9RURFFRUUJOVa1hkmaWWfgeXc/\nsZLt3YFngH7uvrSSfeo0TPJAtm6Ffv2gRw+47z7QvQREJFtEOkzSzDoRkvugypJ7sjVrFtarKS6G\niy8OCV9EJNdVZ5jkY8BbQDczW2FmV5nZMDMbFtvl18ChwP1mVmxm85MYb6WaNw93g2rZEr79bfjo\noyiiEBFJHxkzk7UmJk+GW26BBx4IN/QWEclUWb3YWG3Nnw8//jEMHgxjx4YZsCIimUYJvhLr1oVl\nDQ46CB59VDfzFpHMkxNr0dTGYYeF2/517w69e8Pf/hZ1RCIiqZPVCR7CImV33QXjx8MPfwjTpkUd\nkYhIamR1i6aif/wjrCl/1lkwcaKWHBaR9KcWTTUdfzy8+y6sWAHf/z6sXh11RCIiyZNTCR7gkEPg\n2WfD8gYnnwyvvx51RCIiyZFTLZqKZs8O93u99Va4/notcSAi6UfDJOvgk0/goovghBPCxKjGjaOO\nSERkL/Xg6+DII+Gtt0L1/p3vhIQvIpINcj7BQ6jaH34Yrr4aTj01tG5ERDJdzrdoKnrjDbjkkrC+\n/C23aH15EYmWevAJtnp1WHa4VSuYPj2MvBERiYJ68AlWUABz50KnTmEo5aJFUUckIlJzSvCVaNAg\n3B3q1lvDpKgnnog6IhGRmlGLphqKi8NQyh/9CO68M6xvIyKSCurBp8CGDXDZZbBzJzz+eFipUkQk\n2dSDT4FWreCFF8Iwyt69ww1FRETSmRJ8DeTlwW23we9+B+ecAw8+GHVEIiKVU4umlhYvDksP9+0L\n994LDRtGHZGIZCO1aCLQrRu88w5s2gTf+15YglhEJJ0owddBs2bw5JPh5t59+oSx8yIi6UItmgSZ\nMwcuvxxGjoT/+A8tPSwiiaFhkmnis8/CWPkjj4SHHoKmTaOOSEQynXrwaeLww8NiZY0ahURfWhp1\nRCKSy5TgE6xhQ5g8GTZvht//PupoRCSXqUWTJEuWhElRr70Gxx0XdTQikqnUoklDXbvC7beHC6/f\nfBN1NCKSi1TBJ5E7nHdeuN/rHXdEHY2IZCKNoklj69bBSSfBX/4SJkSJiNSEWjRp7LDDwpo1Q4bA\nli1RRyMiuUQVfIr89KewbVu4ubeISHWpgs8AEybA22/rzlAikjqq4FNo/vywzHBxMbRvH3U0IpIJ\nVMFniD594Oc/h6FDNctVRJKvygRvZg+Z2VozW3iAfX5nZkvMrMTMeiY2xOwyahR89VW4aYiISDJV\np4KfCvSrbKOZDQC6uHtX4Frg/gTFlpXq14c//zncGWrRoqijEZFsVmWCd/fXgU0H2OU8YFps33eA\nFmbWJjHhZaejjoI77wyzXHfsiDoaEclWiejBtwfK389oJdAhAcfNalddFZYV/u//jjoSEclWibrI\nWvEKb24Pl6kGM3jggdCuKSqKOhoRyUb1E3CMVUDHcq87xN7bz5gxY/Y8LywspLCwMAEfn7lat4Yp\nU+CKK6CkBFq0iDoiEYlaUVERRQmq+qo1Dt7MOgPPu/uJcbYNAK539wFmdgpwj7ufEme/nB8HX5nh\nw8P68Y88EnUkIpJukrrYmJk9BpwO5ANrgdHAQQDuPim2z32EkTb/Aq5097/FOY4SfCW2bYNvfQtG\nj4aBA6OORkTSiVaTzALvvQf9+8OCBdCpU9TRiEi60EzWLNCrF9x4o2a5ikjiKMGnkZtuCnd/uvvu\nqCMRkWygFk2a+fTTsGbNnDnQvXvU0YhI1NSiySJHHAF33RVmuW7fHnU0IpLJVMGnIXe4+OJwsXXi\nxKijEZEoaRRNFtqwIdzLddo0OOOMqKMRkaioRZOFWrUKs1yHDoWNG6OORkQykSr4NHfDDbB2LTz+\neFi/RkRyiyr4LDZ+PCxcqGUMRKTmVMFngOJiOOusMMv18MOjjkZEUkkVfJbr2RNGjoQhQ2D37qij\nEZFMoQSfIUaODD8nTIg2DhHJHGrRZJDPPoPeveGll0JVLyLZTy2aHHH44WGdmkGD4Ouvo45GRNKd\nKvgM4w6XXgpt28Jvfxt1NCKSbJrJmmM2bgyzXKdMCaNrRCR7qUWTY1q2hD/9Ca66KixpICISjyr4\nDPbLX8Lnn8OTT2qWq0i2UgWfo26/HRYvhocfjjoSEUlHquAzXEkJnHkmzJ8f1pIXkeyiCj6HnXRS\nuNXf4MGa5Soi+1KCzwK//CU0aAD/+79RRyIi6UQtmiyxYgX06gWzZ4efIpId1KIROnYME58GDYJt\n26KORkTSgSr4LHPZZWGc/H33RR2JiCSCZrLKHps2QY8e8Mc/Qv/+UUcjInWlBC/7mDs3tGpKSiA/\nP+poRKQulOBlP7/6FSxdCs88o1muIplMF1llP+PGwSefwNSpUUciIlFRBZ/FFi6EH/wA3n4bjjoq\n6mhEpDZUwUtcJ54It9wSZrnu2hV1NCKSakrwWW7ECGjcGO64I+pIRCTV1KLJAStXhtmtzz8PffpE\nHY2I1IRaNHJAHTrAvfeGoZP/+lfU0YhIqqiCzyFDhkCTJnD//VFHIiLVpXHwUi1btoTlhe+7D845\nJ+poRKQ6ktqiMbN+ZvaRmS0xs5vibM83sxfN7H0zW2RmQ2sTiCTfIYeEuz9dey2sWxd1NCKSbAes\n4M0sD1gMnAmsAt4FBrr7h+X2GQMc7O6jzCw/tn8bd99V4Viq4NPEzTfDBx/AjBma5SqS7pJZwfcB\nlrr7cnffCTwOnF9hnzVA89jz5sCGisld0svYsWH9+MmTo45ERJKpqgTfHlhR7vXK2HvlPQgcb2ar\ngRJgROLCk2Ro0AAeeQRGjYIlS6KORkSSpaoEX52eyi3A++5eAPQAfm9mzeocmSTVccfB6NFh6OTO\nnVFHIyLJUL+K7auAjuVedyRU8eV9B7gNwN2XmdmnQDdgQcWDjRkzZs/zwsJCCgsLaxywJM7w4TBr\nFtx2G5T7TyMiESoqKqKoqCghx6rqImt9wkXTM4DVwHz2v8g6Edji7v/PzNoA7wHd3X1jhWPpImsa\nWr0aevYMF1xPOSXqaESkoqRdZI1dLL0e+D/gA+Av7v6hmQ0zs2Gx3W4HeptZCfAK8J8Vk7ukr4IC\n+MMfQqvmq6+ijkZEEkkTnQSAK6+Egw6CBx6IOhIRKU9r0Uid/fa38MorMHNm1JGISKKogpc93ngD\nLr4YiouhbduooxER0Fo0kkD/9V/w/vthdI1muYpETy0aSZjRo2HtWpg0KepIRKSuVMHLfj76CL77\nXXjzTejWLepoRHKbKnhJqGOOCevVaJarSGZTgpe4rrsOWrcOiV5EMpNaNFKpL76AHj3gmWfgO9+J\nOhqR3KQWjSRF27bwxz/C4MGwdWvU0YhITamClyr95CfgDlOmRB2JSO5RBS9Jdc898Ne/hlaNiGQO\nVfBSLfPmwYUXhlmu7dpFHY1I7lAFL0l36qkwbBhcdVVo14hI+lOCl2q79VbYsCEsLywi6U8tGqmR\njz+Gvn3htdfg2GOjjkYk+6lFIylz9NEwbhxcfjl8803U0YjIgaiClxpzh/POgxNPhNtvjzoakeym\n5YIl5dauDbNcn3gCTjst6mhEspdaNJJybdqE2/sNGQJbtkQdjYjEowpe6mTYMNi+HaZNizoSkeyk\nCl4iM3FimAT15JNRRyIiFamClzqbPx/OPRf+9jdo3z7qaESyiyp4iVSfPjB8OFx5JZSWRh2NiJRR\ngpeEuOWWsKTwvfdGHYmIlFGLRhJm6dKwZk1RERx/fNTRiGQHtWgkLXTpAnfcEWa57tgRdTQiogpe\nEso9LCvcrRuMHx91NCKZTzNZJa18+SWcdBI89hicfnrU0YhkNrVoJK20bg2TJ4dZrps3Rx2NSO5S\nBS9J87OfwT//CX/+c9SRiGQuVfCSln7zG1iwAB5/POpIRHKTKnhJqgULYMAAeO896Ngx6mhEMo8q\neElbvXvDiBEwdCjs3h11NCK5RQleku7mm6FxYzjiCBg7FlatijoikdygBC9Jl5cHzz8PM2fCF1+E\nO0FdcAHMnq2qXiSZ1IOXlPvqq3DhddKkMGb+Jz+Bq66CgoKoIxNJP0ntwZtZPzP7yMyWmNlNlexT\naGbFZrbIzIpqE4jkjqZNQ1J/91145hlYuTKsXXPhhfDii1qRUiRRDljBm1kesBg4E1gFvAsMdPcP\ny+3TAngT+KG7rzSzfHdfH+dYquClUlu3hpmvkybBxo1wzTWhqm/bNurIRKKVzAq+D7DU3Ze7+07g\nceD8CvtcBjzt7isB4iV3kao0awbXXhuGUz71FCxfDsceCz/6Ebz0kqp6kdqoKsG3B1aUe70y9l55\nXYGWZjbXzBaY2eBEBii5p1evcEPvzz6Df/s3uOkm6NoV7rwT1q6NOjqRzFG/iu3V6akcBHwLOANo\nDMwzs7fdfUnFHceMGbPneWFhIYWFhdUOVHJP8+bw05+GG3svWBDaN8ccA2eeGd77wQ+gnsaBSZYp\nKiqiqKgoIceqqgd/CjDG3fvFXo8CSt19fLl9bgIaufuY2OvJwIvu/lSFY6kHL3W2ZQs88khI9v/6\nV2jrDB0Khx0WdWQiyZHMHvwCoKuZdTazBsAlwMwK+8wAvmtmeWbWGPg28EFtghGpyiGHhEXM3n8/\nJPqPPgprz19yCbz6aliPXkSCKsfBm1l/4B4gD5ji7neY2TAAd58U22ckcCVQCjzo7r+LcxxV8JIU\nmzeHFSsnTQp3krr2WrjiirBssUim0w0/RAjV+9tvh0T/3HPQv3/o1Z9+Olit/nqIRE8JXqSCTZv2\nVvW7doWqfsgQyM+POjKRmlGCF6mEO7z1Vkj0M2fC2WeHqv6001TVS2ZQghepho0bYfr0kOzd9/bq\nW7aMOjKRyinBi9SAO7zxRkj0s2bBueeGqr5vX1X1kn6U4EVqacMGePjhkOzz8vb26g89NOrIRAIl\neJE6cofXXguJ/oUX4PzzQ1V/6qmq6iVaSvAiCbR+PUybFtbDadAgVPWDB0OLFlFHJrlICV4kCdyh\nqChU9S++GNarv/ZaOOUUVfWSOkrwIkn25Zfwpz+Fqr5Ro9C+GTQoLJ0gkkxK8CIpUloKc+eGqv7l\nl+Gii0JV36ePqnpJDiV4kQisXbu3qm/WLFT1l18eljkWSRQleJEIlZbCnDmhqp8zB37841DV9+6t\nql7qTgleJE188QVMnQoPPhhG3QwbBpddFip8kdpQghdJM6Wl8Moroap/9VW4+OKQ7Hv1ijoyyTRK\n8CJpbM0aeOihUNXn54dEP3AgNG0adWSSCZTgRTLA7t1h5M2kSfDXv8IFF4SZsj17wgknQMOGUUco\n6UgJXiTDrFoFTz8N770Xbj+4ZAl06QI9eoSE37NneK7Zs6IEL5Lhtm+Hf/wDiov3Pv7+93Az8fJJ\nv2dPKCjQ6JxcogQvkoV274alS/dN+sXFIbmXVfhlSb9rV6hXL+qIJRmU4EVyhDusXr1/0l+/Hrp3\n3zfxn3ACHHxw1BFLXSnBi+S4TZugpGTfpL90KRx99L5Jv0cPrZ+TaZTgRWQ/X38NixaFi7hlSX/h\nQmjTZt8LuT17Qrt26uunKyV4EamW3bvh44/3TfrFxeFuVuUv5PboEUb1qK8fPSV4Eak1d1i5MiT6\n8ol/48a9ff2ypH/88errp5oSvIgk3MaNexN+2c9ly6Bbt32Tfo8eWkEzmZTgRSQlvv469PHLqvz3\n3w+v27Xbv8XTrl3U0WYHJXgRicyuXaGvX3HoZoMG+yb9nj3hyCPV168pJXgRSSvusGLF/kl/82Y4\n6aR9k/5xx4X/GUh8SvAikhE2bNh/BM+nn8Ixx4Rkf+KJocffrRt07hxG9+Q6JXgRyVjbtu3t6y9a\nFNo9ixeHWyIeeeTehN+tW5i41a1bWHY5VyjBi0jW2bYtrLJZlvDLP+rX3zfhlz3v0iX7ll1WgheR\nnOEO69aFRF8x+X/2WVhtM17y79AhM2frKsGLiAA7d4aefrzk/9VXYdXNeMk/ncfxK8GLiFRhy5a9\nSb988l+yJCT4eL3+I44I7aAoKcGLiNRSaWm4w1ZZwi+f/NesCaN54iX/1q1T0/JJaoI3s37APUAe\nMNndx1ey38nAPODf3f2ZONuV4EUko3z9dVieIV7yd9+31VOW/Lt2hUaNEhdD0hK8meUBi4EzgVXA\nu8BAd/8wzn4vA9uAqe7+dJxjKcHXUFFREYWFhVGHkXV0XpMjl86re7jJSsU+/8cfwyefQNu28Xv9\nHTvWfCZvXRJ8Vd2lPsBSd18e+6DHgfOBDyvs93PgKeDk2gQh8eXSX5hU0nlNjlw6r2ahRdO6NfTt\nu++2Xbtg+fK9yX/RonCD9Y8/DjN5u3SJn/yTcYP1qhJ8e2BFudcrgW+X38HM2hOS/g8ICV5luojk\nrPr1QxLv0gUGDNh329atIdGXJf8XXoC77w6vmzTZv8/frVsdY6lie3WS9T3Aze7uZmZABo40FRFJ\nvmbNoFev8Civ7F675fv8c+eGn3VRVQ/+FGCMu/eLvR4FlJa/0Gpmn7A3qecT+vDXuPvMCsdSZS8i\nUgvJushan3CR9QxgNTCfOBdZy+0/FXg+3igaERFJrQO2aNx9l5ldD/wfYZjkFHf/0MyGxbZPSkGM\nIiJSCymb6CQiIqmle6ukETNbbmZ/N7NiM5sfe6+lmb1sZh+b2UtmloTBVNnDzB4ys7VmtrDce5We\nQzMbZWZLzOwjMzsrmqjTXyXndYyZrYx9X4vNrH+5bTqv1WBmHc1srpn9w8wWmdkNsfcT8p1Vgk8v\nDhS6e0937xN772bgZXc/GpgTey2Vmwr0q/Be3HNoZscBlwDHxX7nD2amvxPxxTuvDkyMfV97uvts\n0HmtoZ3AL9z9eOAUYLiZHUuCvrM66emn4tXy84BpsefTgAtSG05mcffXgU0V3q7sHJ4PPObuO2OT\n+ZYSJvdJBZWcV4g/LFrntZrc/Qt3fz/2/CvCJNL2JOg7qwSfXhx4xcwWmNk1sffauPva2PO1QJto\nQstolZ3DAsLkvTIrCX+5pPp+bmYlZjalXBtB57UWzKwz0BN4hwR9Z5Xg00tfd+8J9Cf8U+208htj\ni/noqngdVOMc6vxW3/3AEUAPYA0w4QD76rwegJk1BZ4GRrj71vLb6vKdVYJPI+6+JvbzS+BZwj+9\n1ppZWwAzawesiy7CjFXZOVwFdCy3X4fYe1IN7r7OY4DJ7G0V6LzWgJkdREju0939udjbCfnOKsGn\nCTNrbGbNYs+bAGcBC4GZwBWx3a4Anot/BDmAys7hTOBSM2tgZkcAXQmT+aQaYomnzIWE7yvovFZb\nbHmXKcAH7n5PuU0J+c5GfK8SKacN8Gz470194BF3f8nMFgBPmNnVwHLg36MLMf2Z2WPA6UC+ma0A\nfg3cSZxz6O4fmNkTwAfALuBnWtM6vjjndTRQaGY9CC2CT4GyCZA6r9XXFxgE/N3MimPvjSJB31lN\ndBIRyVJq0YiIZCkleBGRLKUELyKSpZTgRUSylBK8iEiWUoIXEclSSvAiIllKCV5EJEv9fzeQCh45\nrZ6YAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7fae697076d0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "kmeans_results = pd.DataFrame(\n",
    "    {'Cost' : cost.values()},\n",
    "    index = cost.keys()\n",
    ").sort_index()\n",
    "\n",
    "kmeans_results.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We know that Cost will tend to decrease as far as we increase the number of clusters `k`, but at some point the benefit doesn't improve the clustering result regarding how well the centroids represent our sample distribution. In this case seems that `k=80` might be a good value.   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So let's use this value and cluster the **complete** dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data clustered in 2305.673 seconds\n"
     ]
    }
   ],
   "source": [
    "t0 = time()\n",
    "clusters = KMeans.train(\n",
    "        standardized_data_values, \n",
    "        80, \n",
    "        maxIterations=15, \n",
    "        runs=10, \n",
    "        initializationMode=\"random\"\n",
    ")\n",
    "tt = time() - t0\n",
    "\n",
    "print \"Data clustered in {} seconds\".format(round(tt,3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualising clusters with the help of PCA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section we want to do two things. First we want to visualise the clusters in a two dimensional space. By doing so we will get a feeling of cluster sizes and proximity. The second thing we want to do is to find out the main discriminant variables in our dataset. [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) will help to get both things done."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "*Currently PCA is not available for pySpark.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explaining the biggest clusters using Spark SQL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will end up this exploratory data analyisis with Spark by having a look at tag composition and feature values for the five bigger clusters. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first thing we need to do is to add the cluster ID to each sample, so later on we can query by that value. We proceed as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
